Skip to content

feat(BA-5528): add deployment chat CLI for vLLM-backed model services#11344

Merged
fregataa merged 57 commits into
mainfrom
feat/BA-5528-deployment-chat-cli
May 6, 2026
Merged

feat(BA-5528): add deployment chat CLI for vLLM-backed model services#11344
fregataa merged 57 commits into
mainfrom
feat/BA-5528-deployment-chat-cli

Conversation

@jopemachine
Copy link
Copy Markdown
Member

@jopemachine jopemachine commented Apr 27, 2026

📚 Stacked PRs

This PR is part of a 2-PR stack. Merge in order:

  1. 👉 feat(BA-5528): add deployment chat CLI for vLLM-backed model services #11344feat(BA-5528): add deployment chat CLI for vLLM-backed model services ← you are here
  2. ⬇️ feat(BA-5903): persist deployment chat history and replay as request context #11412feat(BA-5903): persist deployment chat history and replay as request context

Summary

  • Add ./bai deployment chat <id> "<content>" for one-shot OpenAI-compatible chat against deployed inference services. Requests are sent directly to the deployment's inference endpoint with optional Authorization: Bearer <token> (the value the runtime — vLLM/SGLang/NIM/TGI/custom — was started with), bypassing the Backend.AI manager. Use --params to forward runtime-variant-specific sampling knobs.
  • Add ./bai deployment chat-config set/show/clear/clear-cache to register, inspect, and remove per-deployment chat state.
  • Auto-derive the request model when the user did not specify one: the CLI calls GET /v1/models on the inference endpoint, picks data[0].id, and caches it as cache.default_model for subsequent calls (matches the webui ChatCard.tsx fallback). The user is no longer required to run chat-config set --model before the first chat.
  • Persist state under ~/.backend.ai/deployment_chat/, grouped per-feature (matching the existing ~/.backend.ai/session/ layout used by ./bai login):
    • cache.json — auto-managed: endpoint_url, default_model (auto-fetched from /v1/models), last_synced_at (24-h TTL).
    • config.json — user-managed: per-deployment { token, model } entries. The user's model takes precedence over cache.default_model.
  • Both files are written via plain path.write_text() to match the existing CLI credential-storage convention (client/cli/v2/config_cmd.py). On 401/403 from the inference endpoint, the cached token for that deployment is cleared and the user is prompted to re-register.
  • Add an SDK-side BackendAIAppProxyClient base in client/v2/base_client.py for direct-to-deployment HTTP traffic (Bearer-token auth, app-proxy-aware JSON parsing) and a thin DeploymentChatClient subclass exposing chat_completion() and list_models() (returning a typed ListModelsResponse).

Model resolution order

When the runtime needs a model field for a chat call, the CLI walks this list and stops at the first hit:

  1. --model <name> on the chat command line.
  2. config.<deployment-id>.model — the user's pinned model in config.json.
  3. cache.<deployment-id>.default_model — the auto-derived value in cache.json.
  4. GET /v1/models on the inference endpoint, taking data[0].id. The result is written to cache.default_model so subsequent calls skip the round trip.

This means a fresh deployment works with zero configuration as long as the runtime serves /v1/models; you only need chat-config set --model for multi-model deployments where [0] is not the right pick.

Command usage

# One-shot chat — model is auto-derived on first call from /v1/models
./bai deployment chat <deployment-id> "Hello, who are you?"

# Override the model for one call
./bai deployment chat <deployment-id> "..." --model llama-3-8b-instruct

# Forward runtime-specific sampling knobs as a JSON object
./bai deployment chat <deployment-id> "..." \
    --params '{"temperature": 0.7, "max_tokens": 256}'

# Register a Bearer token for a token-gated deployment
./bai deployment chat-config set <deployment-id> --token <runtime-token>

# Pin a model (overrides the cached default; useful for multi-model deployments)
./bai deployment chat-config set <deployment-id> --model llama-3-8b-instruct

# Set both at once
./bai deployment chat-config set <deployment-id> \
    --token <runtime-token> --model llama-3-8b-instruct

# Inspect what's currently registered/cached (token is masked)
./bai deployment chat-config show <deployment-id>

# Remove the user-managed config entry (token + model) for a deployment
./bai deployment chat-config clear <deployment-id>

# Force-invalidate the auto-managed cache entry (endpoint_url, default_model)
./bai deployment chat-config clear-cache <deployment-id>

chat-config set writes to config.json only — it does not contact the manager, so it stays usable while the deployment is still provisioning or the manager is unreachable. chat-config clear and clear-cache operate on the two storage files independently: clearing user config never touches the cache, and vice versa.

On-disk state

State lives under ~/.backend.ai/deployment_chat/ so it stays grouped with the other Backend.AI CLI state directories.

cache.json — auto-managed by the CLI; do not hand-edit.

{
  "deployments": {
    "d55e251a-3a70-408d-97a9-ca305502aa58": {
      "endpoint_url": "https://app-proxy.example.com/v1/some-deployment",
      "default_model": "llama-3-8b-instruct",
      "last_synced_at": "2026-04-29T12:34:56.789012+00:00"
    }
  }
}
  • endpoint_url — fetched from the manager's deployment.network_access.endpoint_url and refreshed on a 24-hour TTL.
  • default_model — auto-derived from GET /v1/models on first use; never written by chat-config set.
  • last_synced_at — UTC timestamp of the last manager fetch; entries past CACHE_ENTRY_TTL (24 h) are treated as a cache miss.

config.json — user-managed: one { token, model } entry per deployment.

{
  "deployments": {
    "d55e251a-3a70-408d-97a9-ca305502aa58": {
      "token": "sk-runtime-token-here",
      "model": "llama-3-8b-instruct"
    }
  }
}

Either field may be nullchat-config set upserts only the fields you pass, and an entry is dropped automatically once both fields are cleared. The token is also cleared automatically on 401/403 from the inference endpoint so the next chat call surfaces the re-register hint instead of silently re-sending a stale credential.

Resolves BA-5528.

Copilot AI review requested due to automatic review settings April 27, 2026 07:49
@github-actions github-actions Bot added size:XL 500~ LoC comp:client Related to Client component comp:cli Related to CLI component labels Apr 27, 2026
@jopemachine jopemachine marked this pull request as draft April 27, 2026 07:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new CLI workflow for one-shot OpenAI-compatible chat calls against deployed vLLM inference endpoints, backed by a per-deployment local cache and a dedicated SDK-side HTTP client/DTOs (bypassing the Backend.AI manager API).

Changes:

  • Add ./bai deployment chat and ./bai deployment chat-config set/show/clear commands plus a JSON cache at ~/.backend.ai/deployment_chat.json (0600, atomic write).
  • Add DeploymentChatClient (direct aiohttp client) and OpenAI-compatible Pydantic DTOs under the v2 client package.
  • Add unit tests for the direct chat client and the cache load/save semantics.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/unit/client/v2/test_deployment_chat_client.py Unit tests for direct vLLM chat posting, auth error handling, serialization, and session ownership.
tests/unit/client/cli/test_deployment_chat_cache.py Unit tests for cache schema/version guard, permissions, atomic write, masking, and tolerant loading.
src/ai/backend/client/v2/domains_v2/deployment_chat.py New direct-to-inference chat client (aiohttp) with OpenAI-compatible request/response handling.
src/ai/backend/client/v2/chat_dto.py New Pydantic DTOs for /v1/chat/completions request/response payloads with forward-compatible extra fields.
src/ai/backend/client/cli/v2/deployment_chat_cache.py New cache implementation for endpoint URL + vLLM API key persistence with 0600 permissions and atomic writes.
src/ai/backend/client/cli/v2/deployment/chat_config.py New chat-config CLI group to set/show/clear cache entries.
src/ai/backend/client/cli/v2/deployment/chat.py New chat CLI command to send one-shot chat completions and invalidate cached key on 401/403.
src/ai/backend/client/cli/v2/deployment/init.py Registers the new chat and chat-config commands under deployment.
changes/5528.feature.md Changelog entry for the new CLI commands and cache behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/ai/backend/client/cli/v2/deployment_chat_cache.py Outdated
Comment thread src/ai/backend/client/cli/v2/deployment_chat_cache.py Outdated
Comment thread src/ai/backend/client/v2/domains_v2/deployment_chat.py Outdated
Comment thread src/ai/backend/client/v2/domains_v2/deployment_chat.py Outdated
Comment thread src/ai/backend/client/cli/v2/deployment/chat_config.py Outdated
Comment thread src/ai/backend/client/cli/v2/deployment/chat.py Outdated
Comment thread src/ai/backend/client/cli/v2/deployment/chat_config.py Outdated
Comment thread src/ai/backend/client/cli/v2/deployment/chat_config.py Outdated
Comment thread src/ai/backend/client/v2/chat_dto.py Outdated
Comment thread src/ai/backend/client/v2/domains_v2/deployment_chat.py Outdated
Comment thread tests/unit/client/cli/test_deployment_chat_cache.py Outdated
Comment thread tests/unit/client/cli/test_deployment_chat_cache.py Outdated
Comment thread changes/11344.feature.md Outdated
Comment thread src/ai/backend/client/cli/v2/deployment/chat.py Outdated
Comment thread src/ai/backend/client/v2/chat_dto.py Outdated
Comment thread src/ai/backend/client/v2/chat_dto.py Outdated
Comment thread src/ai/backend/client/cli/v2/deployment/chat.py Outdated
jopemachine added a commit that referenced this pull request Apr 27, 2026
Address review comments from #11344:

- Drop chat_dto.py and switch the SDK to plain dict[str, Any] for both
  request and response, so it doesn't try to track every runtime
  variant's extension fields (vllm reasoning_content, tool_calls, etc.)
- Rename DeploymentChatClient -> InferenceChatClient and decouple it
  from the vllm runtime variant: works against any OpenAI-compatible
  endpoint (vllm, tgi, sglang, nim) and exposes a configurable path
  plus a list_models helper
- Rename the cached api key field vllm_api_key -> api_key throughout
  the cache schema, CLI options, show output, and tests
- chat-config set: --token is now optional and pairs with a new
  --no-token flag for deployments started without --api-key. The
  served model name is auto-discovered via GET /v1/models (option B
  from the discussion) so users no longer have to know it
- chat: replace the local _abort helper with click.ClickException,
  validate --max-tokens via click.IntRange(min=1) and the sampling
  knobs via click.FloatRange, and add --top-p, --frequency-penalty,
  --presence-penalty, --seed, --stop options
- inference_chat client: add ClientTimeout (sock_connect/sock_read)
  to the owned aiohttp session and normalize trailing slashes when
  building the chat / models URL
- cache loader: tolerate corrupted JSON (OSError/JSONDecodeError) and
  skip individual malformed entries instead of aborting the whole load
- tests: drop redundant atomic-write/permission-reset cases, add
  loader resilience cases, and shorten the changelog entry
Comment thread src/ai/backend/client/cli/v2/deployment/chat.py Outdated
Comment thread src/ai/backend/client/cli/v2/deployment/chat.py Outdated
Comment thread src/ai/backend/client/cli/v2/deployment/chat_config.py Outdated
Comment thread src/ai/backend/client/cli/v2/deployment/chat/commands.py Outdated
Comment thread src/ai/backend/client/cli/v2/deployment/chat_config.py Outdated
Comment thread src/ai/backend/client/cli/v2/deployment/chat_config.py Outdated
Comment thread src/ai/backend/client/cli/v2/deployment/chat/commands.py Outdated
Comment thread src/ai/backend/client/cli/v2/deployment_chat_cache.py Outdated
jopemachine added a commit that referenced this pull request Apr 28, 2026
Address review comments on PR #11344:

- chat.py:
  - Drop the auto-clear of the cached API key on inference 401/403 — it
    was deleting user-supplied config out from under them. Just raise
    the error and ask the user to re-register.
  - Use print() instead of sys.stdout.write() for the response payload.
- chat_config.py:
  - Remove --no-token; clearing is the dedicated chat-config clear
    command's job. Resolved-key handling collapses to a single expression.
  - Use print() instead of click.echo() for status output.
  - Parse the inference endpoint's /v1/models response with a typed
    Pydantic model (_ServedModelsResponse) instead of manual dict.get
    walking.
  - _print_entry now delegates the entry portion to
    DeploymentChatCacheEntry.format_summary() so the per-entry fields
    are owned by the cache type.
- deployment_chat_cache.py / deployment_chat_config.py:
  - Drop schema_version as a Pydantic field on the wrapper model. The
    version is metadata, not data — emit it manually around model_dump
    in save_*, and check it manually in load_* before validating
    individual records.
- DeploymentChatCacheEntry gains a format_summary() method returning the
  endpoint/default_model/last_synced_at lines so consumers don't
  duplicate that formatting.
Comment thread src/ai/backend/client/v2/deployment_chat.py Outdated
Comment thread src/ai/backend/client/v2/deployment_chat.py Outdated
Comment thread src/ai/backend/client/v2/deployment_chat.py Outdated
Comment thread tests/unit/client/cli/test_deployment_chat_utils.py Outdated
Comment thread src/ai/backend/client/cli/v2/deployment/chat/commands.py Outdated
jopemachine added a commit that referenced this pull request Apr 28, 2026
…Args type

Address review comments on PR #11344:

- Drop _owns_session and the optional session= kwarg on
  DeploymentChatClient. Match BackendAIAuthClient: __init__ takes a
  pre-built session, factory method create() builds one, close() always
  closes. Removes the dual-ownership branch.
- Introduce DeploymentChatClientArgs (frozen dataclass) for connection
  knobs (skip_ssl_verification, connect_timeout, read_timeout).
  Callers use DeploymentChatClient.create(args) instead of passing
  multiple kwargs to the constructor.
- Rename chat_completion's 'request' parameter to 'body'.
- Tests: rename the cache-entry helper to _make_entry, the chat-body
  helper to _make_body. Drop TestExternalSession since the new
  contract is 'whatever you pass to __init__ gets closed'.
Comment thread src/ai/backend/client/cli/v2/deployment/chat/commands.py Outdated
Comment thread src/ai/backend/client/cli/v2/deployment/chat/utils.py Outdated
jopemachine and others added 26 commits May 6, 2026 13:59
The cache file holds endpoint URL, model name and a sync timestamp —
no secrets. The 0600 chmod was copy-pasted from the config file path
where it actually matters (plaintext API keys). Default umask applies
to the cache; only save_chat_config keeps the chmod. Module/function
docstrings updated and the corresponding cache permission test goes
away.
…oning

Token registration is purely user-side state — it should not block on
the deployment's runtime status. Previously set_ went through
_resolve_endpoint_entry which raises 'no endpoint_url yet' when the
deployment is in DEPLOYING/PROVISIONING, dropping the user's token
along with the cache write.

Restructure set_:
1. Always fetch the deployment record (so a typo in deployment_id still
   surfaces a 404).
2. Save the token unconditionally when --token is provided.
3. Write the cache entry only when endpoint_url is already populated;
   otherwise warn that --default-model will be picked up on the first
   chat call once the deployment is READY.

The chat command's _resolve_endpoint_entry is unchanged — chat still
requires a usable endpoint to talk to.
- DeploymentChatCache/Config gain `save()` instance methods (paired with
  the existing `load()` classmethods); free functions in utils.py removed.
- `_write_text_file` writes via tmp+rename and creates the file with the
  target permission directly, closing the brief world-readable window
  that `write_text() + chmod(0600)` left open on the config file.
- `is_fresh()` flipped to `is_expired()` to align with the cache miss
  call site.
- `_resolve_endpoint_entry` had a single caller and an unused
  `default_model_override` parameter; inlined into `chat`.
- Renamed local `connection` to `connection_config` to match
  `V2ConnectionConfig`.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…ish naming

- 401/403 from the inference endpoint now clears the stored API key for
  that deployment so the user is not silently retried with a known-bad
  token. The error message tells the user the cache was cleared.
- Replace the ad-hoc ``dict[str, Any]`` chat body with
  ``ChatCompletionRequest`` (pydantic, ``extra="allow"``) so runtime-
  variant-specific knobs supplied via ``--params`` still flow through
  while the model/messages shape is enforced.
- Rename ``chat_config_store`` → ``chat_config`` in the ``chat`` command
  and ``config`` inside the ``chat-config`` subcommands to match the
  reviewer's preferred naming and avoid shadowing the click group.
- Clarify ``_ensure_dict`` wording: payloads that are valid JSON but not
  an object now report ``non-object payload (type=...)`` instead of the
  misleading ``non-JSON response``.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…der, named timeout consts

- Rename ``api_key`` to ``token`` across CLI flag binding, local
  variables, client method signatures, error messages, and the
  ``chat-config show`` summary label so the user-facing vocabulary
  matches the storage method names (``get_token``/``set_token``).
- Replace the length-leaking ``sk-***...***xxxx``-style mask with a
  fixed ``********`` placeholder that never reveals the token's
  prefix, suffix, or length.
- Pull ``DeploymentChatClientArgs`` magic numbers into named module
  constants (``DEFAULT_CONNECT_TIMEOUT_SEC``, ``DEFAULT_READ_TIMEOUT_SEC``).
- Update the affected test names and assertions accordingly.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…edential convention

- Drop the bespoke tmp-and-rename / 0600-permission helper used for
  ``deployment_chat_config.json``. The existing CLI credential store
  (``client/cli/v2/config_cmd.py``) writes plain TOML without atomic
  semantics or explicit permissions; the chat config now matches that
  convention rather than introducing a stricter parallel one.
- Introduce ``write_json_file`` in ``utils.py`` so the cache and config
  models share a single, plain ``mkdir`` + ``write_text`` helper.
- Drop the ``test_config_save_enforces_0600`` test along with the
  no-longer-needed ``os``/``stat`` imports.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…chat tests with aioresponses

- Collapse ``_read_payload``/``_ensure_dict`` into a single read-and-parse
  block inside ``_request``: parse ``resp.text()`` as JSON in one step,
  surface ``BackendAPIError`` (with the raw body in ``detail``) when the
  status is already a 4xx/5xx, and only raise ``BackendClientError``
  when a 2xx body is unparsable. The clarified comment now names
  Backend.AI's app-proxy as the layer that produces non-JSON 5xx pages.
- Remove ``--path`` from ``./bai deployment chat``. The CLI body is
  fixed to OpenAI-shaped ``{model, messages}`` via
  ``ChatCompletionRequest``, so a custom path never paired with a
  matching custom body — keeping the option encouraged the
  misconception that arbitrary inference contracts could be driven
  through this command. The SDK still accepts ``path`` as a kwarg for
  programmatic callers.
- Migrate ``test_deployment_chat_client.py`` from a real ``aiohttp.web``
  test server to ``aioresponses``-based mocks, matching the existing
  client-test convention (see ``tests/unit/client/test_resource_usage.py``).
  Headers and JSON body are asserted via ``m.requests``. New
  coverage: HTML 5xx now produces a ``BackendAPIError`` whose
  ``detail`` carries the upstream body verbatim.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…ndAIAppProxyClient

- Add :class:`BackendAIAppProxyClient` to ``client/v2/base_client.py``:
  a ``ClientConfig``-driven base for SDK-side, direct-to-deployment
  HTTP traffic. It owns the aiohttp session, ``_request`` (with
  Bearer-token auth, app-proxy-aware JSON parsing, status-to-exception
  mapping), URL normalization, and the lifecycle hooks. The name is
  deliberately distinct from ``manager/clients/appproxy/client.py``'s
  ``AppProxyClient`` (control plane: coordinator admin API with
  ``X-BackendAI-Token``); this base sits in the SDK and handles the
  data plane (per-deployment Bearer-token traffic).
- Trim ``DeploymentChatClient`` to a single OpenAI Chat Completions
  method on top of the new base. Drop the ABC layer / separate
  ``OpenAICompatibleChatClient`` / ``DeploymentChatClientArgs`` /
  per-module timeout constants — those duties now live on
  ``BackendAIAppProxyClient`` and ``ClientConfig``. The path constant
  is renamed ``_OPENAI_COMPATIBLE_CHAT_PATH`` to make the contract
  explicit at the call site.
- Rename ``DeploymentChatAuthError`` → ``DeploymentAuthError`` since
  the 401/403 mapping now lives on the AppProxy base and is no longer
  chat-specific.
- Update the CLI to build a ``ClientConfig`` from
  ``V2ConnectionConfig`` and instantiate ``DeploymentChatClient``
  directly. Tests follow the same construction path.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…polish help text

- ``DeploymentChatCache.remove`` → ``pop`` and
  ``DeploymentChatConfig.clear_token`` → ``pop_token`` so the names match
  the underlying ``dict.pop`` semantics (return value indicates whether
  something was actually removed).
- Inline the ``TOKEN_PLACEHOLDER`` constant into ``mask_token`` — the
  literal only has one call site.
- Reword ``./bai deployment chat-config set --token`` help text:
  "Omit when the deployment is open to public" instead of the previous
  runtime-startup phrasing.
- Update tests for the renames.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…at/ subdirectory

Match the existing per-feature subdirectory layout used by ``./bai login``
(``~/.backend.ai/session/cookie.dat`` + ``session/config.json``):

- ``~/.backend.ai/deployment_chat.json`` →
  ``~/.backend.ai/deployment_chat/cache.json``
- ``~/.backend.ai/deployment_chat_config.json`` →
  ``~/.backend.ai/deployment_chat/config.json``

Drops the ``deployment_chat_`` filename prefix duplication and lets future
chat-related files (logs, sessions, etc.) land naturally under the same
directory.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
… omitted

Previously ``./bai deployment chat`` errored out when neither ``--model``
nor a cached ``default_model`` was provided. Now the CLI calls
``GET /v1/models`` on the deployment's inference endpoint, picks the first
``id`` (matches webui ChatCard.tsx fallback), and caches it as the
deployment's ``default_model`` so subsequent ``chat`` calls reuse it.

Add ``DeploymentChatClient.list_models()`` returning a typed
``ListModelsResponse`` so the CLI consumes ``models_response.data[0].id``
instead of dict-drilling. Hoist the ``DeploymentAuthError`` handler to the
whole ``async with`` block (auth handling is identical for both
``/v1/models`` and ``/v1/chat/completions``) and drop the per-call
``BackendAPIError`` handlers — ``_run_async`` already formats them.
…under entry

Introduce ``DeploymentChatConfigEntry { token, model }`` so per-deployment
user state lives in one nested record (mirrors ``DeploymentChatCacheEntry``)
instead of two parallel ``tokens`` / ``models`` dicts.

Resolution order in ``chat`` becomes: ``--model`` flag > ``config.model``
(user-set, ``config.json``) > ``cache.default_model`` (auto, ``cache.json``)
> ``GET /v1/models[0].id`` (auto-fetched and cached). Both fields can
co-exist; the user-set value always wins, matching the user's
"config는 사용자, cache는 자동" mental model.

CLI surface changes:
- Rename ``chat-config set --default-model`` to ``--model``; the flag now
  writes to ``config.json`` (user store) instead of ``cache.json`` (auto
  store), so the new name matches the field it sets.
- Drop the manager fetch from ``chat-config set`` — both token and model
  go to ``config.json`` only, so the command stays usable while the
  deployment is still provisioning or the manager is unreachable.
- Rename ``chat-config clear-config`` to ``chat-config clear``; clears the
  whole user config entry (token + model) for that deployment.
- Keep ``chat-config clear-cache`` for invalidating the auto-managed cache
  entry (``endpoint_url``, ``default_model``, ``last_synced_at``) on demand
  rather than waiting for the 24h TTL.
- ``chat-config show`` now prints both the user-set ``model`` and the
  auto-cached ``default_model`` so the resolved value is clear at a glance.
…es to one line

Replace the manual ``self.deployments.get(id) or DeploymentChatConfigEntry()``
+ ``self.deployments[id] = entry`` dance with a ``defaultdict``-backed store
so ``set_token`` / ``set_model`` reduce to a single bracket assignment.

Pydantic v2 cannot infer a default factory for a ``defaultdict`` whose value
is a ``BaseModel`` subclass, so the field annotation uses
``Annotated[..., Field(default_factory=...)]`` per the explicit form
``PydanticSchemaGenerationError`` directs callers to. Without it, importing
the module raises at class-construction time:

    Unable to infer a default factory for keys of type
    DeploymentChatConfigEntry. Only set, bool, str, tuple, dict, int,
    frozenset, float, list are supported, other types require an explicit
    default factory set using DefaultDict[..., Annotated[..., Field(
    default_factory=...)]]

Read paths (``get`` / ``get_token`` / ``get_model`` / ``pop_*``) still go
through ``dict.get`` / ``dict.pop`` so a missing-key lookup never plants a
stale empty entry.
…onfig

The block was restating things the code already says (method names already
imply read vs write paths) and explaining pydantic boilerplate that the
import-time error message itself points at, so it was net noise.
Relocate the wire-format and persistence Pydantic models added in this
PR into the shared `common/` tree so any backend.ai component can
consume them, not just the CLI:

- OpenAI-compat wire DTOs (`ChatCompletionMessage`,
  `ChatCompletionRequest`, `ListModelsResponse`, `ModelEntry`) →
  `common/dto/clients/openai_compat/{request,response}.py`,
  paralleling the existing `common/dto/clients/prometheus/` layout for
  third-party HTTP service contracts.
- Chat persistence data types (`DeploymentChatCache(Entry)`,
  `DeploymentChatConfig(Entry)`, `CACHE_ENTRY_TTL`) →
  `common/data/deployment_chat/types.py` as pure Pydantic models with
  no I/O coupling.

Disk load/save lives in a new
`client/cli/v2/deployment/chat/storage.py` (`load_chat_cache`,
`save_chat_cache`, `load_chat_config`, `save_chat_config`) so the data
types stay free of `client.cli` imports — `common/` MUST NOT depend on
component-specific packages per `common/dto/CLAUDE.md`. The previous
`DeploymentChatCache.load`/`.save` classmethods that pulled in
`client.cli.v2.deployment.chat.utils` are removed in favor of these
free functions, eliminating the backward dependency.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…e in

Reverse the relocation done in 69fd070:

- `DeploymentChatCache(Entry)` and `DeploymentChatConfig(Entry)` (and
  `CACHE_ENTRY_TTL`) move from `common/data/deployment_chat/` back to
  `client/cli/v2/deployment/chat/types.py`.
- `load_chat_*` / `save_chat_*` free functions go away; the
  corresponding `.load()` / `.save()` classmethods are restored on
  `DeploymentChatCache` / `DeploymentChatConfig`.
- `client/cli/v2/deployment/chat/storage.py` is removed — the typed
  models own their own disk format directly.

`common/dto/clients/openai_compat/{request,response}.py` (the
OpenAI-compat wire DTOs) are left in place, since those are reused by
the SDK and may grow more component consumers.
Address review feedback on PR #11344 — the OpenAI-compat chat endpoint
treats each turn as a "message" with role/content, so the user-facing
CLI argument is more naturally named `message`. Update the click
argument declaration, the function parameter, the help text, and the
request-body construction site.

The JSON key on the wire stays `content` (that's the OpenAI spec); only
the local variable / argument name changes.
…onfig

`chat-config show` was printing both the cache (auto-managed
`endpoint_url` / `default_model` / `last_synced_at`) and the user's
config (`token` / `model`) in one block, which blurred the
responsibility split between the two files.

Trim the command to print only the config entry it owns. Drop
``DeploymentChatFormatter.print_summary``/``entry_lines`` (the only
consumers of the cache half) in favor of a dedicated
``print_config(deployment_id, entry)``. Update the formatter test to
match.
…mand group

The auto-managed cache and the user-managed config are two separate
files (``cache.json`` vs ``config.json``); having a `clear-cache`
subcommand under `chat-config` mixed the two responsibilities.

Replace ``./bai deployment chat-config clear-cache`` with a dedicated
``chat-cache`` group:

- ``./bai deployment chat-cache show <id>`` — print the cached
  endpoint metadata (``endpoint_url``, ``default_model``,
  ``last_synced_at``) for inspection / debugging.
- ``./bai deployment chat-cache clear <id>`` — drop the cache entry,
  forcing the next ``chat`` call to refetch endpoint and re-derive the
  default model.

``DeploymentChatFormatter`` gains ``print_cache(deployment_id, entry)``
to render the cache view; the `chat-config clear` docstring is updated
to reference the new path.
…args

Address review feedback (PR #11412 discussion r3165318334) — the
deployment id values flowing through the chat data classes, the
formatter, and the click handlers represent a deployment, not a generic
UUID. Switch the static signatures to
``ai.backend.common.identifier.deployment.DeploymentID`` (a
``NewType(UUID)``) so type checkers can distinguish deployment ids
from other UUIDs without any runtime cost.

The click ``type=click.UUID`` parser still emits a plain ``UUID`` at
runtime; ``DeploymentID`` is structurally identical, so the wrap is
implicit and no conversion is needed at the boundary.
…history None vs empty

- Extract `_OpenAICompatModel` base class so all OpenAI-compat response DTOs
  share a single `ConfigDict(extra="allow")` declaration instead of repeating
  it on each subclass.
- In `history_show`, distinguish "no history record" (`messages is None`) from
  the invariant-violating "record exists but empty list" case so the CLI
  message reflects the actual state instead of conflating both as falsy.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Python's `pop` convention (`dict.pop`, `list.pop`) implies the popped value
is returned, but these methods return a plain `bool` because every caller
only needs "did anything actually get removed?" Renaming so the method
names match what the calls do:

- `DeploymentChatCache.pop`        → `delete`        (removes the entry)
- `DeploymentChatConfig.pop`       → `delete`        (removes the entry)
- `DeploymentChatConfig.pop_token` → `clear_token`   (nulls the field, drops the entry only when both fields are unset)
- `DeploymentChatConfig.pop_model` → `clear_model`   (same shape as `clear_token`)

`pop_token`/`pop_model` were already misnomers — they null one field rather
than fully popping the entry, so `clear_*` reflects the actual behavior.
Return types stay `bool` since no caller uses the popped value.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…lient._request`

Address PR #11344 review: split JSON parsing and payload validation
into a dedicated method so `_request` only orchestrates the HTTP call
and status handling.
`PrometheusQueryPresetRepository.preview_template` was rewired to call
`PrometheusClient.preview_query_template` in #11274, but the component
test added in #11482 still mocked the now-unused `query_instant`. The
real client method falls through and returns an `AsyncMock`, so the
PrometheusResponse model fails validation and the API returns 500.

Mock the method actually called so the preview-endpoint tests cover
the success path and the FailedToGetMetric → PrometheusQueryEvaluationFailed
mapping again.
@jopemachine jopemachine force-pushed the feat/BA-5528-deployment-chat-cli branch from f269ef6 to 3e59aaf Compare May 6, 2026 05:00
@fregataa fregataa merged commit f7164ac into main May 6, 2026
33 checks passed
@fregataa fregataa deleted the feat/BA-5528-deployment-chat-cli branch May 6, 2026 07:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:cli Related to CLI component comp:client Related to Client component comp:common Related to Common component size:XL 500~ LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants